class: center, middle, inverse, title-slide .title[ # Module 2: Data Wrangling ] .subtitle[ ## Introduction to Tools of the Trade in Data Analysis ] .author[ ### Dr. Christopher Kenaley ] .institute[ ### Boston College ] .date[ ### 2025/9/8 ] --- class: inverse, top # In class today <!-- Add icon library --> <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/5.14.0/css/all.min.css"> .pull-left[ Today we'll .... ## Tidy data - Shape of data (long vs wide) - Pivoting - Joining and relational data - Peak under the hood of Module Project 2 ] --- class: inverse, top <!-- slide 1 --> ## Shape of data (`%>%`)? Wide: one row per sample; many columns for measurements Iris ships in wide form (4 measurement columns + Species) Pros: - Easy to read raw values - Convenient for direct modeling/viz with specific columns ``` r head(iris, 6) ``` ``` ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3.0 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5.0 3.6 1.4 0.2 setosa ## 6 5.4 3.9 1.7 0.4 setosa ``` --- class: inverse, top <!-- slide 1 --> ## Shape of data (`%>%`)? Long: many rows per sample; fewer columns for measurements Iris ships in long form (4 measurement in one column) Pros: - Tidy structure enables grouped summaries, faceting, and tidy verbs (functions) ``` r iris_long <- iris %>% pivot_longer( cols =Sepal.Length:Petal.Width ) iris_long %>% head(6) ``` ``` ## # A tibble: 6 × 3 ## Species name value ## <fct> <chr> <dbl> ## 1 setosa Sepal.Length 5.1 ## 2 setosa Sepal.Width 3.5 ## 3 setosa Petal.Length 1.4 ## 4 setosa Petal.Width 0.2 ## 5 setosa Sepal.Length 4.9 ## 6 setosa Sepal.Width 3 ``` --- class: inverse, top <!-- slide 1 --> ``` r iris_long %>% ggplot(aes(x = name, y = value, fill = Species)) + geom_boxplot() + facet_grid(.~Species)+ theme(axis.text.x = element_text(angle=60,hjust=1)) ``` <img src="3140_f25_9-08_files/figure-html/unnamed-chunk-4-1.png" height="30%" style="display: block; margin: auto;" /> --- class: inverse, top <!-- slide 1 --> ## Merging/joining data Let's make some pH data ``` r pH <- iris_long %>% group_by(Species,name) %>% reframe(pH=runif(1,5,8)) ``` --- class: inverse, top <!-- slide 1 --> ## Merging/joining data join based on species ``` r iris_long %>% left_join(pH) ``` ``` ## Joining with `by = join_by(Species, name)` ``` ``` ## # A tibble: 600 × 4 ## Species name value pH ## <fct> <chr> <dbl> <dbl> ## 1 setosa Sepal.Length 5.1 5.52 ## 2 setosa Sepal.Width 3.5 5.54 ## 3 setosa Petal.Length 1.4 5.23 ## 4 setosa Petal.Width 0.2 5.46 ## 5 setosa Sepal.Length 4.9 5.52 ## 6 setosa Sepal.Width 3 5.54 ## 7 setosa Petal.Length 1.4 5.23 ## 8 setosa Petal.Width 0.2 5.46 ## 9 setosa Sepal.Length 4.7 5.52 ## 10 setosa Sepal.Width 3.2 5.54 ## # ℹ 590 more rows ``` --- class: inverse, top <!-- slide 1 --> ## Merging/joining data join based on species and plot ``` r iris_long %>% left_join(pH) %>% ggplot(aes(pH,value,col=name))+geom_point(alpha=0.1)+ facet_grid(.~Species) ``` ``` ## Joining with `by = join_by(Species, name)` ``` <img src="3140_f25_9-08_files/figure-html/unnamed-chunk-7-1.png" height="30%" style="display: block; margin: auto;" />